Introduction

This assignment focuses on applying the Feature Engineering processes and the Evaluation methods to solve a practical scenario: Predict the price of houses.

Data Reading and preparation

The dataset is offered in two separated fields, one for the training and another one for the test set.

original_training_data = read.csv(file = file.path("dataset/house_price_train.csv"))
original_test_data = read.csv(file = file.path("dataset/house_price_test.csv"))

To avoid applying the Feature Engineering process two times (once for training and once for test), I will join both datasets (using the rbind function), apply the FE and then split the datasets again. However, if I try to do join the two dataframes as they are, we will get an error because they do not have the same columns: test_data does not have a column price. Therefore, I first create this column in the test set and then we join the data

original_test_data$price <- 0
dataset <- as.data.table(rbind(original_training_data, original_test_data))
class(dataset)
## [1] "data.table" "data.frame"

Let’s now visualize the dataset to see where to begin

summary(dataset)
##        id                   date           price            bedrooms     
##  Min.   :1.000e+06   6/23/2014:  142   Min.   :      0   Min.   : 1.000  
##  1st Qu.:2.123e+09   6/25/2014:  131   1st Qu.: 220000   1st Qu.: 3.000  
##  Median :3.905e+09   6/26/2014:  131   Median : 384000   Median : 3.000  
##  Mean   :4.580e+09   7/8/2014 :  127   Mean   : 431877   Mean   : 3.373  
##  3rd Qu.:7.309e+09   4/27/2015:  126   3rd Qu.: 580000   3rd Qu.: 4.000  
##  Max.   :9.900e+09   3/25/2015:  123   Max.   :7700000   Max.   :33.000  
##                      (Other)  :20817                                     
##    bathrooms      sqft_living       sqft_lot           floors     
##  Min.   :0.500   Min.   :  370   Min.   :    520   Min.   :1.000  
##  1st Qu.:1.750   1st Qu.: 1430   1st Qu.:   5040   1st Qu.:1.000  
##  Median :2.250   Median : 1910   Median :   7618   Median :1.500  
##  Mean   :2.116   Mean   : 2080   Mean   :  15099   Mean   :1.494  
##  3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10685   3rd Qu.:2.000  
##  Max.   :8.000   Max.   :13540   Max.   :1651359   Max.   :3.500  
##                                                                   
##    waterfront            view          condition        grade       
##  Min.   :0.000000   Min.   :0.0000   Min.   :1.00   Min.   : 3.000  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.00   1st Qu.: 7.000  
##  Median :0.000000   Median :0.0000   Median :3.00   Median : 7.000  
##  Mean   :0.007547   Mean   :0.2343   Mean   :3.41   Mean   : 7.658  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.00   3rd Qu.: 8.000  
##  Max.   :1.000000   Max.   :4.0000   Max.   :5.00   Max.   :13.000  
##                                                                     
##    sqft_above   sqft_basement       yr_built     yr_renovated    
##  Min.   : 370   Min.   :   0.0   Min.   :1900   Min.   :   0.00  
##  1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.00  
##  Median :1560   Median :   0.0   Median :1975   Median :   0.00  
##  Mean   :1789   Mean   : 291.7   Mean   :1971   Mean   :  84.46  
##  3rd Qu.:2210   3rd Qu.: 560.0   3rd Qu.:1997   3rd Qu.:   0.00  
##  Max.   :9410   Max.   :4820.0   Max.   :2015   Max.   :2015.00  
##                                                                  
##     zipcode           lat             long        sqft_living15 
##  Min.   :98001   Min.   :47.16   Min.   :-122.5   Min.   : 399  
##  1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490  
##  Median :98065   Median :47.57   Median :-122.2   Median :1840  
##  Mean   :98078   Mean   :47.56   Mean   :-122.2   Mean   :1987  
##  3rd Qu.:98118   3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360  
##  Max.   :98199   Max.   :47.78   Max.   :-121.3   Max.   :6210  
##                                                                 
##    sqft_lot15    
##  Min.   :   651  
##  1st Qu.:  5100  
##  Median :  7620  
##  Mean   : 12758  
##  3rd Qu.: 10083  
##  Max.   :871200  
## 

Data Cleaning

In this section I am going to preform some data cleaning.

The feature Id and deta are not going to offer any advantage for prediction.

dataset[, c('id', 'date'):=NULL]
colnames(dataset)
##  [1] "price"         "bedrooms"      "bathrooms"     "sqft_living"  
##  [5] "sqft_lot"      "floors"        "waterfront"    "view"         
##  [9] "condition"     "grade"         "sqft_above"    "sqft_basement"
## [13] "yr_built"      "yr_renovated"  "zipcode"       "lat"          
## [17] "long"          "sqft_living15" "sqft_lot15"

Factorize features

If we go back to the summary of the dataset we can identify some numerical features that are actually categories: What we have to do is to convert them to the proper ‘class’ or ‘type’ using the as.factor command.

dataset$condition <- factor(dataset$condition)
dataset$grade <- factor(dataset$grade)
dataset$zipcode <- factor(dataset$zipcode)
dataset$yr_built <- factor(dataset$yr_built)
dataset$yr_renovated <- factor(dataset$yr_renovated)
dataset$view <- factor(dataset$view)

#dataset$date <- as.Date.character(dataset$date, '%m/%d/%Y')

# lets now turn characters into factors
dataset[ ,names(dataset)[sapply(dataset, is.character)]:=lapply(.SD,as.factor),
           .SDcols = names(dataset)[sapply(dataset, is.character)]]
## Warning in `[.data.table`(dataset, , `:=`(names(dataset)[sapply(dataset, :
## length(LHS)==0; no columns to delete or assign RHS to.

Hunting NAs

Let’s check if there are NAs in the dataset

#Counting columns with null values.
na.cols <- which(colSums(is.na(dataset)) > 0)
paste('There are', length(na.cols), 'columns with missing values')
## [1] "There are 0 columns with missing values"

Outliers

We will now focus on numerical values. In this section I will detect numerical features which present outliers and I will clip those values.

# Classify all numeric columns
column_types <- sapply(names(dataset), function(x) {
    class(dataset[[x]])
  }
)
numeric_columns <- names(column_types[column_types != "factor"])

# Identify the the feature with outliers
outliers <- scores(dataset[,numeric_columns, with=FALSE], type="chisq", prob=0.9)
per_outliers <- sort(((colSums(outliers)/length(dataset$price))*100), decreasing = TRUE)
col_outliers_to_remove <- names(per_outliers[per_outliers > 5 & names(per_outliers) != 'lat' 
                                             & names(per_outliers) != 'long'])

# Plot the boxplots of the selected variables
for (i in col_outliers_to_remove){
  boxplot(dataset[,i, with=FALSE], type="p", xlab=i)
}

# clipping the outliers
for (i in col_outliers_to_remove){
  qnt <- quantile(dataset[[i]], probs=c(.25, .75))
  caps <- quantile(dataset[[i]], probs=c(.05, .95))
  H <- 1.5 * IQR(dataset[[i]])
  dataset[[i]][dataset[[i]] < (qnt[1] - H)] <- caps[1]
  dataset[[i]][dataset[[i]] > (qnt[2] + H)] <- caps[2]
}

# Plot again the box plots to verify if the procedure has been succesful
for (i in col_outliers_to_remove){
  boxplot(dataset[,i, with=FALSE], type="p", xlab=i)
}

Skewness

We now need to detect skewness in the Target value. Let’s see what is the effect of skewness on a variable, and plot it using ggplot. The way of getting rid of the skewness is to use the log of the values of that feature, to flatten it.

# plot the skewness of the target variable
df <- rbind(data.frame(version="price",x=original_training_data$price),
            data.frame(version="log(price+1)",x=log(original_training_data$price + 1)))

ggplot(data=df) +
  facet_wrap(~version,ncol=2,scales="free_x") +
  geom_histogram(aes(x=x), bins = 50)

We therefore transform the target value applying log

The same “skewness” observed in the target variable also affects other variables. To facilitate the application of the regression model we are going to also eliminate this skewness. For numeric feature with excessive skewness, perform log transformation

# setting up a threshold to decide to which numeric feature apply the log transformation
skewness_threshold = 1

Now, let’s compute the skewness of each feature that is not ‘factor’ nor ‘character’.

column_types <- sapply(names(dataset), function(x) {
    class(dataset[[x]])
  }
)
numeric_columns <- names(column_types[column_types != "factor"])
numeric_columns <- numeric_columns[numeric_columns != 'lat' & numeric_columns != 'long']

And now, with that information, we need to calculate the skewness of each column whose name is our list of factor (or categorical) features.

# skew of each variable
skew <- sapply(numeric_columns, function(x) { 
    e1071::skewness(dataset[[x]], na.rm = T)
  }
)

What we do need to make now is to apply the log to those whose skewness value is below a given threshold that we’ve set in 1.

# transform all variables above a threshold skewness.
skew <- skew[abs(skew) > skewness_threshold]
for(x in names(skew)) {
  dataset[[x]] <- log(dataset[[x]] + 1)
}

Feature Creation

In this section I will create some new features to improve the predictive power of the dataset.

# having a look at the columns
colnames(dataset)
##  [1] "price"         "bedrooms"      "bathrooms"     "sqft_living"  
##  [5] "sqft_lot"      "floors"        "waterfront"    "view"         
##  [9] "condition"     "grade"         "sqft_above"    "sqft_basement"
## [13] "yr_built"      "yr_renovated"  "zipcode"       "lat"          
## [17] "long"          "sqft_living15" "sqft_lot15"
# creating a total square feet feature
dataset$TotalSqFeet <- as.numeric(dataset$sqft_basement + dataset$sqft_above + dataset$sqft_lot + dataset$sqft_living)
# Creating a Remodrnatoin variable (0 = No Remodeling, 1 = Remodeling)
dataset$Remod <- ifelse(as.numeric(dataset$yr_built)==as.numeric(dataset$yr_renovated), 0, 1)
dataset$Remod <- factor(dataset$Remod) 
# We will create a label that will agregate into "others" those grade with less than 3% of share
niche_grade<-names(which(summary(dataset$grade)/nrow(dataset)<0.01))
niche_grade
## [1] "3"  "4"  "12" "13"
dataset[, grade_agg:=as.factor(ifelse(grade%in%niche_grade,'others',as.character(grade)))]

summary(dataset$grade)/nrow(dataset)
##            3            4            5            6            7 
## 4.630273e-05 1.250174e-03 1.120526e-02 9.436496e-02 4.155207e-01 
##            8            9           10           11           12 
## 2.808260e-01 1.210816e-01 5.250729e-02 1.847479e-02 4.120943e-03 
##           13 
## 6.019355e-04
summary(dataset$grade_agg)/nrow(dataset)
##          10          11           5           6           7           8 
## 0.052507293 0.018474788 0.011205260 0.094364958 0.415520674 0.280826041 
##           9      others 
## 0.121081632 0.006019355
sum(summary(dataset$grade_agg)/nrow(dataset)) 
## [1] 1
dataset[, length(levels(grade_agg))]
## [1] 8
dataset[, length(levels(grade))]
## [1] 11
dataset[, length(levels(grade_agg))/length(levels(grade))-1] # important reduction in factor cathegories
## [1] -0.2727273
dataset[, grade_agg:=factor(grade_agg, levels=names(sort(summary(dataset$grade_agg), dec=T)))]
p<-ggplot(dataset, aes(x=grade_agg))+geom_bar(stat='count')+
  theme(axis.text.x = element_text(angle=45))
p

# we drop off the former grade variable
dataset <- dataset[, grade:=NULL]
# Now lets check the number of categories per factor variable
factor_variables<-names(dataset)[sapply(dataset, is.factor)]
count_factor_variables<-sapply(dataset[,factor_variables, with=F], summary)
count_factor_variables
## $view
##     0     1     2     3     4 
## 19475   332   961   510   319 
## 
## $condition
##     1     2     3     4     5 
##    29   170 14020  5677  1701 
## 
## $yr_built
##    2014    2006    2005    2004    2003    1977    2007    1978    1968 
##     559     453     450     433     420     417     417     387     381 
##    2008    1967    1979    1959    1990    1962    1954    2001    1987 
##     367     350     343     334     317     312     305     305     294 
##    1989    1969    1955    1988    1947    1999    1963    1976    1950 
##     290     280     271     270     263     263     255     253     249 
##    1966    1994    1960    1980    1998    1948    2009    1951    1984 
##     249     249     248     240     239     234     230     229     229 
##    1985    1958    1961    1991    1942    1953    2002    1952    2000 
##     227     224     224     224     223     222     222     220     218 
##    1986    1983    1993    2013    1981    1956    1957    1992    1949 
##     215     212     202     201     199     198     198     198     195 
##    1996    1975    1965    1926    1997    1964    1943    2012    1995 
##     194     189     187     180     177     172     170     170     169 
##    1925    1974    1941    1940    1972    1973    2010    1944    1924 
##     165     162     161     156     149     149     143     140     139 
##    1910    1970    2011    1928    1946    1918    1927    1929    1939 
##     134     132     130     126     126     120     115     114     106 
##    1982    1971    1920    1922    1945    1909    1906    1930    1919 
##     105     104      98      95      95      94      92      90      88 
##    1900    1908    1923    1912    1916    1921    1905    1911    1937 
##      87      86      84      79      79      76      74      73      68 
## (Other) 
##     748 
## 
## $yr_renovated
##     0  1934  1940  1944  1945  1946  1948  1950  1951  1953  1954  1955 
## 20683     1     2     1     3     2     1     2     1     3     1     3 
##  1956  1957  1958  1959  1960  1962  1963  1964  1965  1967  1968  1969 
##     3     3     5     1     4     2     4     5     5     2     8     4 
##  1970  1971  1972  1973  1974  1975  1976  1977  1978  1979  1980  1981 
##     9     2     4     5     3     6     3     8     6    10    11     5 
##  1982  1983  1984  1985  1986  1987  1988  1989  1990  1991  1992  1993 
##    11    18    18    17    17    18    15    22    25    20    17    19 
##  1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005 
##    19    16    15    15    19    17    35    19    22    36    26    35 
##  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015 
##    24    35    18    22    18    13    11    37    91    16 
## 
## $zipcode
## 98001 98002 98003 98004 98005 98006 98007 98008 98010 98011 98014 98019 
##   361   199   280   317   168   498   141   283   100   195   124   190 
## 98022 98023 98024 98027 98028 98029 98030 98031 98032 98033 98034 98038 
##   233   499    80   412   283   321   256   273   125   432   545   589 
## 98039 98040 98042 98045 98052 98053 98055 98056 98058 98059 98065 98070 
##    50   282   547   220   574   403   268   406   455   468   308   117 
## 98072 98074 98075 98077 98092 98102 98103 98105 98106 98107 98108 98109 
##   273   441   359   198   351   104   602   229   335   266   186   109 
## 98112 98115 98116 98117 98118 98119 98122 98125 98126 98133 98136 98144 
##   269   583   330   553   507   184   290   409   354   493   263   343 
## 98146 98148 98155 98166 98168 98177 98178 98188 98198 98199 
##   288    57   446   254   269   255   262   136   280   317 
## 
## $Remod
##     0     1 
##    84 21513 
## 
## $grade_agg
##      7      8      9      6     10     11      5 others 
##   8974   6065   2615   2038   1134    399    242    130

Modelling

Train, Validation Spliting

To facilitate the data cleaning and feature engineering we merged train and test datasets. We now split them again to create our final model.

training_data <- dataset[which(dataset$price!=0),]
test <- dataset[which(dataset$price==0),]

We are going to split the annotated dataset in training and validation for the later evaluation of our regression models

splits <- splitdf(training_data)
training <- splits$trainset
validation <- splits$testset

Feature Selection

We here start the Feature Selection.

Full Model

Let’s try first a baseline including all the features to evaluate the impact of the feature engineering.

lm.model(training, validation, "Baseline")

Information Gain Selection

Let’s try to use information gain to select the most important variables. In this section I will also use a cross validation procedure to select the best threshold for my model. As it is possible to notice the results are better than the baseline so from now on I will use the ig_training for all the models.

# applying information gain with the optimal trashold found
weights<- data.frame(information.gain(formula, training))
weights$feature <- rownames(weights)
weights[order(weights$attr_importance, decreasing = TRUE),]
##               attr_importance       feature
## zipcode          0.4078836296       zipcode
## grade_agg        0.2888902300     grade_agg
## lat              0.2771141814           lat
## sqft_living      0.2733019702   sqft_living
## TotalSqFeet      0.2505272291   TotalSqFeet
## sqft_living15    0.2182127778 sqft_living15
## sqft_above       0.1955334775    sqft_above
## bathrooms        0.1437739598     bathrooms
## yr_built         0.0665257095      yr_built
## sqft_lot15       0.0618703528    sqft_lot15
## bedrooms         0.0593158206      bedrooms
## floors           0.0572538468        floors
## long             0.0527523602          long
## view             0.0486812688          view
## sqft_lot         0.0485797797      sqft_lot
## sqft_basement    0.0421027350 sqft_basement
## yr_renovated     0.0184476174  yr_renovated
## condition        0.0084169301     condition
## waterfront       0.0072894744    waterfront
## Remod            0.0004652503         Remod
information_gain_features <- weights$feature[weights$attr_importance > information_gain_CV(training, validation,
                                                                                           int_min = 0.002, 
                                                                                           int_max = 0.006, 
                                                                                           steps = 0.01)]

ig_training <- training[,c(information_gain_features,"price"),with=FALSE]
ig_test <- test[,c(information_gain_features,"price"),with=FALSE]
ig_validation <- validation[,c(information_gain_features,"price"),with=FALSE]

lm.model(ig_training, ig_validation, "Information Gain")

Ridge Regression

Let’s try the Ridge Regression

lambdas <- 10^seq(-3, 0, by = .05)

set.seed(121)
train_control_config <- trainControl(method = "repeatedcv", 
                                     number = 5, 
                                     repeats = 1,
                                     returnResamp = "all")

ridge.mod <- train(formula, data = ig_training, 
               method = "glmnet", 
               metric = "RMSE",
               trControl=train_control_config,
               tuneGrid = expand.grid(alpha = 0, lambda = lambdas))

Plotting the RMSE for the different lambda values, we can see the impact of this parameter in the model performance. Small values seem to work better for this dataset.

plot(ridge.mod)

Plotting the coefficients for different lambda values. As expected the larger the lambda (lower Norm) value the smaller the coefficients of the features. However, as we can see at the top of the features, there is no feature selection; i.e., the model always consider the 225 parameters.

plot(ridge.mod$finalModel)

# evaluating the model 
ridge.mod.pred <- predict(ridge.mod, ig_validation)
ridge.mod.pred[is.na(ridge.mod.pred)] <- 0

my_data <- as.data.frame(cbind(predicted=(exp(ridge.mod.pred) -1), observed=(exp(validation$price) -1)))
ridge.mod.mape <- mape(ig_validation$price, ridge.mod.pred)
ridge.mod.price_error <- mean(abs((exp(ridge.mod.pred) -1) - (exp(validation$price) -1)))

# plottong the results
ggplot(my_data, aes(predicted, observed)) +
    geom_point() + geom_smooth(method = "glm") +
    labs(x="Predicted") +
    ggtitle(ggtitle(paste("Ridge", ' MAPE: ', format(round(ridge.mod.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(ridge.mod.price_error, 0), nsmall=0), 
                        '€', sep=''))) +  
    scale_x_continuous(labels = scales::comma) + 
    scale_y_continuous(labels = scales::comma)

Rank the variables according to the importance attributed by the model.

# Print, plot variable importance
plot(varImp(ridge.mod), top = 20) # 20 most important features

Lasso Regresion

Now Let’s try lasso in combination with information gain. Moreover, I will use cross validation to identify the best number of seed

# applying the lasso regression with the optimal seed value found above.
lambdas <- 10^seq(-3, 0, by = .05)

set.seed(lasso_CV(training, validation,10,60,10))
train_control_config <- trainControl(method = "repeatedcv", 
                                     number = 5, 
                                     repeats = 1,
                                     returnResamp = "all")

lasso.mod <- train(formula, data = ig_training, 
               method = "glmnet", 
               metric = "RMSE",
               trControl=train_control_config,
               tuneGrid = expand.grid(alpha = 1, lambda = lambdas))

Plotting the RMSE for the different lambda values, we can see the impact of this parameter in the model performance. Small values seem to work better for this dataset.

plot(lasso.mod)

Plotting the coefficients for different lambda values. As expected the larger the lambda (lower Norm) value the smaller the coefficients of the features. However, as we can see at the top of the features, there is no feature selection; i.e., the model always consider the 225 parameters.

plot(lasso.mod$finalModel)

# evaluating the model
lasso.mod.pred <- predict(lasso.mod, ig_validation)
lasso.mod.pred[is.na(lasso.mod.pred)] <- 0

my_data <- as.data.frame(cbind(predicted=(exp(lasso.mod.pred) -1), observed=(exp(validation$price) -1)))
lasso.mod.mape <- mape(lasso.mod.pred, validation$price)
lasso.mod.price_error <- mean(abs((exp(lasso.mod.pred) -1) - (exp(validation$price) -1)))

# plotting the results
ggplot(my_data, aes(predicted, observed)) +
    geom_point() + geom_smooth(method = "glm") +
    labs(x="Predicted") +
    ggtitle(ggtitle(paste("lasso", ' MAPE: ', format(round(lasso.mod.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(lasso.mod.price_error, 0), nsmall=0), 
                        ' €', sep=''))) +  
    scale_x_continuous(labels = scales::comma) + 
    scale_y_continuous(labels = scales::comma)

Rank the variables according to the importance attributed by the model.

# Print, plot variable importance
plot(varImp(lasso.mod), top = 20) # 20 most important features

Random Forest

Now Let’s try Random Forest. After running all the models with the the defoult params, XGB was the best performing. For this reason I will apply the Hyper-parameters tuning just for xgb.

# defining the model
rf_0<-ranger(formula=formula, data=ig_training)
rf.pred<-predict(rf_0, data = ig_validation)$predictions

# evaluation
rf.mape <- mape(rf.pred, validation$price)
rf.price_error <- mean(abs((exp(rf.pred) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(rf.pred) -1), observed=(exp(validation$price) -1)))

# plotting the results
ggplot(my_data, aes(predicted, observed)) +
    geom_point() + geom_smooth(method = "glm") +
    labs(x="Predicted") +
    ggtitle(ggtitle(paste("Random Forest", ' MAPE: ', format(round(rf.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(rf.price_error, 0), nsmall=0), 
                        ' €', sep=''))) +  
    scale_x_continuous(labels = scales::comma) + 
    scale_y_continuous(labels = scales::comma)

XGBoost

In this section we’ll use xgb. First, we have to transform the data to allow xgb to work. We need to create dummies and to cast the data into a matrix

xgb_ig_training <- caret::dummyVars(formula= ~., data = ig_training, fullRank=T,sep = "_")
xgb_ig_training <- data.table(predict(xgb_ig_training, newdata = ig_training))
xgb_ig_training <- as.matrix(xgb_ig_training[, !'price', with=F])

xgb_ig_validation <- caret::dummyVars(formula= ~., data = ig_validation, fullRank=T,sep = "_")
xgb_ig_validation <- data.table(predict(xgb_ig_validation, newdata = ig_validation))
xgb_ig_validation <- as.matrix(xgb_ig_validation[, !'price', with=F])

xgb_ig_test <- caret::dummyVars(formula= ~., data = ig_test, fullRank=T,sep = "_")
xgb_ig_test <- data.table(predict(xgb_ig_test, newdata = ig_test))
xgb_ig_test <- as.matrix(xgb_ig_test[, !'price', with=F])

Now let’s define our first model using all the default params

# defining the model
xgb_0<-xgboost(booster='gbtree',
               data=xgb_ig_training,
               label= training$price,
               nrounds = 100,
               objective='reg:linear')
## [1]  train-rmse:8.797826 
## [2]  train-rmse:6.163718 
## [3]  train-rmse:4.320491 
## [4]  train-rmse:3.031123 
## [5]  train-rmse:2.129980 
## [6]  train-rmse:1.500232 
## [7]  train-rmse:1.061951 
## [8]  train-rmse:0.758129 
## [9]  train-rmse:0.549899 
## [10] train-rmse:0.408636 
## [11] train-rmse:0.313780 
## [12] train-rmse:0.255180 
## [13] train-rmse:0.219143 
## [14] train-rmse:0.196475 
## [15] train-rmse:0.183525 
## [16] train-rmse:0.176495 
## [17] train-rmse:0.170329 
## [18] train-rmse:0.166563 
## [19] train-rmse:0.164260 
## [20] train-rmse:0.162487 
## [21] train-rmse:0.160702 
## [22] train-rmse:0.158476 
## [23] train-rmse:0.157014 
## [24] train-rmse:0.155735 
## [25] train-rmse:0.154442 
## [26] train-rmse:0.153304 
## [27] train-rmse:0.152436 
## [28] train-rmse:0.151475 
## [29] train-rmse:0.149598 
## [30] train-rmse:0.148630 
## [31] train-rmse:0.147728 
## [32] train-rmse:0.146997 
## [33] train-rmse:0.145515 
## [34] train-rmse:0.144383 
## [35] train-rmse:0.143948 
## [36] train-rmse:0.143151 
## [37] train-rmse:0.141956 
## [38] train-rmse:0.141490 
## [39] train-rmse:0.140731 
## [40] train-rmse:0.140127 
## [41] train-rmse:0.138974 
## [42] train-rmse:0.138424 
## [43] train-rmse:0.138132 
## [44] train-rmse:0.137885 
## [45] train-rmse:0.137498 
## [46] train-rmse:0.136887 
## [47] train-rmse:0.136329 
## [48] train-rmse:0.135181 
## [49] train-rmse:0.134821 
## [50] train-rmse:0.134519 
## [51] train-rmse:0.133952 
## [52] train-rmse:0.133598 
## [53] train-rmse:0.133463 
## [54] train-rmse:0.132937 
## [55] train-rmse:0.132142 
## [56] train-rmse:0.131602 
## [57] train-rmse:0.130801 
## [58] train-rmse:0.130180 
## [59] train-rmse:0.130006 
## [60] train-rmse:0.129591 
## [61] train-rmse:0.129177 
## [62] train-rmse:0.128540 
## [63] train-rmse:0.128336 
## [64] train-rmse:0.127511 
## [65] train-rmse:0.127066 
## [66] train-rmse:0.126900 
## [67] train-rmse:0.126551 
## [68] train-rmse:0.126350 
## [69] train-rmse:0.125934 
## [70] train-rmse:0.125443 
## [71] train-rmse:0.125126 
## [72] train-rmse:0.124738 
## [73] train-rmse:0.124470 
## [74] train-rmse:0.124130 
## [75] train-rmse:0.123966 
## [76] train-rmse:0.123661 
## [77] train-rmse:0.123295 
## [78] train-rmse:0.122729 
## [79] train-rmse:0.122505 
## [80] train-rmse:0.121792 
## [81] train-rmse:0.121512 
## [82] train-rmse:0.121117 
## [83] train-rmse:0.120924 
## [84] train-rmse:0.120695 
## [85] train-rmse:0.120372 
## [86] train-rmse:0.119882 
## [87] train-rmse:0.119450 
## [88] train-rmse:0.119280 
## [89] train-rmse:0.119183 
## [90] train-rmse:0.118754 
## [91] train-rmse:0.118551 
## [92] train-rmse:0.118238 
## [93] train-rmse:0.117803 
## [94] train-rmse:0.117400 
## [95] train-rmse:0.117203 
## [96] train-rmse:0.116636 
## [97] train-rmse:0.116284 
## [98] train-rmse:0.116028 
## [99] train-rmse:0.115786 
## [100]    train-rmse:0.115674
xgb.pred<-predict(xgb_0, newdata = xgb_ig_validation, type='response')
# evaluation
xgb.mape <- mape(xgb.pred, validation$price)
xgb.price_error <- mean(abs((exp(xgb.pred) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(xgb.pred) -1), observed=(exp(validation$price) -1)))

# Plotting the results
ggplot(my_data, aes(predicted, observed)) +
    geom_point() + geom_smooth(method = "glm") +
    labs(x="Predicted") +
    ggtitle(ggtitle(paste("XGBoost", ' MAPE: ', format(round(xgb.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(xgb.price_error, 0), nsmall=0), 
                        ' €', sep=''))) +  
    scale_x_continuous(labels = scales::comma) + 
    scale_y_continuous(labels = scales::comma)

XGB Optimized

To have a better performance I will use grid-search to find the optimal value of the function’s hyperparameters.

searchGridSubCol <- expand.grid(subsample = c(0.9, 1), 
                                colsample_bytree = c(0.5, 0.6),
                                max_depth = c(8, 9),
                                min_child = seq(1), 
                                eta = c(0.1, 0.2),
                                gamma = c(1)
)

ntrees <- 100

system.time(
rmseErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
  
  #Extract Parameters to test
  currentSubsampleRate <- parameterList[["subsample"]]
  currentColsampleRate <- parameterList[["colsample_bytree"]]
  currentDepth <- parameterList[["max_depth"]]
  currentEta <- parameterList[["eta"]]
  currentMinChild <- parameterList[["min_child"]]
  currentgamma <- parameterList[["gamma"]]
  xgboostModelCV <- xgb.cv(data =  xgb_ig_training, 
                           label = training$price, 
                           nrounds = ntrees, 
                           nfold = 5, 
                           showsd = TRUE, 
                           metrics = "rmse", 
                           verbose = TRUE, 
                           "eval_metric" = "rmse",
                           "objective" = "reg:linear", 
                           "max.depth" = currentDepth,
                           "eta" = currentEta, 
                           "gamma" = currentgamma,
                           "subsample" = currentSubsampleRate, 
                           "colsample_bytree" = currentColsampleRate, 
                           print_every_n = 10, 
                           "min_child_weight" = currentMinChild, 
                           booster = "gbtree",
                           early_stopping_rounds = 10)
  
  xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
  rmse <- tail(xvalidationScores$test_rmse_mean, 1)
  trmse <- tail(xvalidationScores$train_rmse_mean,1)
  output <- return(c(rmse, trmse, currentSubsampleRate, currentColsampleRate, currentDepth, currentEta, currentMinChild))}))
## [1]  train-rmse:11.307232+0.003575   test-rmse:11.306831+0.013733 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:3.956090+0.001376    test-rmse:3.955863+0.008751 
## [21] train-rmse:1.401310+0.000663    test-rmse:1.402677+0.005352 
## [31] train-rmse:0.525391+0.001488    test-rmse:0.528876+0.002554 
## [41] train-rmse:0.253110+0.001386    test-rmse:0.260110+0.003505 
## [51] train-rmse:0.190346+0.001897    test-rmse:0.199892+0.003959 
## [61] train-rmse:0.177906+0.001261    test-rmse:0.188094+0.003199 
## [71] train-rmse:0.174978+0.000904    test-rmse:0.185344+0.002781 
## [81] train-rmse:0.174127+0.000669    test-rmse:0.184627+0.002715 
## [91] train-rmse:0.173718+0.000520    test-rmse:0.184250+0.002643 
## [100]    train-rmse:0.173216+0.000468    test-rmse:0.183770+0.002804 
## [1]  train-rmse:11.307014+0.001923   test-rmse:11.307285+0.008602 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:3.956350+0.001291    test-rmse:3.956666+0.009927 
## [21] train-rmse:1.400062+0.000996    test-rmse:1.401425+0.009593 
## [31] train-rmse:0.524502+0.000522    test-rmse:0.528403+0.008602 
## [41] train-rmse:0.251462+0.000951    test-rmse:0.259498+0.005381 
## [51] train-rmse:0.188306+0.000267    test-rmse:0.198668+0.003268 
## [61] train-rmse:0.176970+0.000398    test-rmse:0.188043+0.003048 
## [71] train-rmse:0.173956+0.000604    test-rmse:0.185355+0.003128 
## [81] train-rmse:0.172623+0.000752    test-rmse:0.184143+0.003641 
## [91] train-rmse:0.172154+0.000643    test-rmse:0.183682+0.003480 
## [100]    train-rmse:0.171877+0.000912    test-rmse:0.183426+0.003376 
## [1]  train-rmse:11.307141+0.002329   test-rmse:11.306845+0.009424 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:3.956319+0.000762    test-rmse:3.956198+0.007355 
## [21] train-rmse:1.399648+0.000654    test-rmse:1.400617+0.007340 
## [31] train-rmse:0.523590+0.000767    test-rmse:0.527064+0.006551 
## [41] train-rmse:0.252202+0.000842    test-rmse:0.259090+0.004965 
## [51] train-rmse:0.189118+0.001027    test-rmse:0.198630+0.003970 
## [61] train-rmse:0.177143+0.000630    test-rmse:0.187357+0.003825 
## [71] train-rmse:0.174601+0.000677    test-rmse:0.184941+0.003782 
## [81] train-rmse:0.173662+0.000597    test-rmse:0.184154+0.003641 
## [91] train-rmse:0.173413+0.000616    test-rmse:0.183914+0.003761 
## [100]    train-rmse:0.172571+0.000935    test-rmse:0.183258+0.003574 
## [1]  train-rmse:11.306962+0.001160   test-rmse:11.307312+0.004800 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:3.955661+0.000749    test-rmse:3.956050+0.003201 
## [21] train-rmse:1.398606+0.000567    test-rmse:1.399727+0.003935 
## [31] train-rmse:0.522329+0.000820    test-rmse:0.526007+0.003907 
## [41] train-rmse:0.250535+0.000899    test-rmse:0.258675+0.002386 
## [51] train-rmse:0.186684+0.000963    test-rmse:0.197348+0.001455 
## [61] train-rmse:0.176029+0.000936    test-rmse:0.187403+0.001747 
## [71] train-rmse:0.174044+0.001169    test-rmse:0.185552+0.001632 
## [81] train-rmse:0.172895+0.000835    test-rmse:0.184476+0.001924 
## [91] train-rmse:0.172101+0.000511    test-rmse:0.183745+0.002111 
## [100]    train-rmse:0.171880+0.000530    test-rmse:0.183556+0.002037 
## [1]  train-rmse:11.307011+0.002885   test-rmse:11.307003+0.011734 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:3.956411+0.001470    test-rmse:3.956651+0.010595 
## [21] train-rmse:1.400518+0.000987    test-rmse:1.402248+0.010195 
## [31] train-rmse:0.524826+0.000578    test-rmse:0.528903+0.007911 
## [41] train-rmse:0.252578+0.000574    test-rmse:0.260608+0.005307 
## [51] train-rmse:0.188710+0.001215    test-rmse:0.199616+0.004363 
## [61] train-rmse:0.177361+0.001244    test-rmse:0.189011+0.003520 
## [71] train-rmse:0.174466+0.001477    test-rmse:0.186377+0.003056 
## [81] train-rmse:0.173376+0.001306    test-rmse:0.185395+0.003263 
## [91] train-rmse:0.172685+0.001576    test-rmse:0.184792+0.003205 
## [100]    train-rmse:0.172498+0.001705    test-rmse:0.184601+0.003167 
## [1]  train-rmse:11.307183+0.001649   test-rmse:11.307488+0.006868 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:3.956694+0.000627    test-rmse:3.957390+0.006014 
## [21] train-rmse:1.399802+0.000808    test-rmse:1.401128+0.005196 
## [31] train-rmse:0.524272+0.000679    test-rmse:0.527692+0.003435 
## [41] train-rmse:0.251448+0.001029    test-rmse:0.258833+0.003931 
## [51] train-rmse:0.187942+0.001334    test-rmse:0.197844+0.005289 
## [61] train-rmse:0.176837+0.001391    test-rmse:0.187427+0.006227 
## [71] train-rmse:0.173865+0.001269    test-rmse:0.184890+0.006283 
## [81] train-rmse:0.172776+0.001204    test-rmse:0.183899+0.006178 
## [91] train-rmse:0.172077+0.000906    test-rmse:0.183290+0.006336 
## [100]    train-rmse:0.171896+0.000790    test-rmse:0.183104+0.006486 
## [1]  train-rmse:11.307043+0.001891   test-rmse:11.306707+0.007990 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:3.955851+0.000880    test-rmse:3.955356+0.004055 
## [21] train-rmse:1.399575+0.000656    test-rmse:1.400509+0.002731 
## [31] train-rmse:0.523008+0.000675    test-rmse:0.526506+0.003528 
## [41] train-rmse:0.250699+0.001370    test-rmse:0.258097+0.005178 
## [51] train-rmse:0.187693+0.001545    test-rmse:0.197632+0.004785 
## [61] train-rmse:0.175870+0.001511    test-rmse:0.186523+0.004815 
## [71] train-rmse:0.173732+0.001331    test-rmse:0.184567+0.004717 
## [81] train-rmse:0.172801+0.001120    test-rmse:0.183714+0.004375 
## [91] train-rmse:0.172224+0.001302    test-rmse:0.183230+0.004476 
## [100]    train-rmse:0.171872+0.001446    test-rmse:0.182990+0.004611 
## [1]  train-rmse:11.306971+0.001272   test-rmse:11.306683+0.005017 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:3.955715+0.000700    test-rmse:3.955347+0.003195 
## [21] train-rmse:1.398353+0.000524    test-rmse:1.399172+0.002621 
## [31] train-rmse:0.522113+0.001353    test-rmse:0.525807+0.002733 
## [41] train-rmse:0.249727+0.001702    test-rmse:0.257901+0.001878 
## [51] train-rmse:0.186344+0.000994    test-rmse:0.197200+0.002564 
## [61] train-rmse:0.175018+0.000932    test-rmse:0.186739+0.002840 
## [71] train-rmse:0.172678+0.000655    test-rmse:0.184704+0.003123 
## [81] train-rmse:0.172045+0.000882    test-rmse:0.184150+0.003314 
## [91] train-rmse:0.171488+0.000947    test-rmse:0.183688+0.003376 
## [100]    train-rmse:0.171154+0.000799    test-rmse:0.183386+0.003295 
## [1]  train-rmse:10.053293+0.001959   test-rmse:10.053207+0.007510 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:1.108818+0.001403    test-rmse:1.109944+0.004515 
## [21] train-rmse:0.222271+0.002383    test-rmse:0.231639+0.003346 
## [31] train-rmse:0.179618+0.002389    test-rmse:0.191244+0.001997 
## [41] train-rmse:0.176600+0.002033    test-rmse:0.188615+0.002456 
## [51] train-rmse:0.175030+0.001605    test-rmse:0.187260+0.003339 
## [61] train-rmse:0.174173+0.001575    test-rmse:0.186585+0.003415 
## [71] train-rmse:0.173923+0.001376    test-rmse:0.186385+0.003382 
## [81] train-rmse:0.173671+0.001199    test-rmse:0.186174+0.003288 
## [91] train-rmse:0.173430+0.001240    test-rmse:0.185923+0.003080 
## [100]    train-rmse:0.173270+0.001445    test-rmse:0.185778+0.003101 
## [1]  train-rmse:10.052837+0.001501   test-rmse:10.052746+0.008566 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:1.108060+0.000456    test-rmse:1.109155+0.004970 
## [21] train-rmse:0.222517+0.001911    test-rmse:0.231039+0.003742 
## [31] train-rmse:0.178248+0.002797    test-rmse:0.189178+0.003792 
## [41] train-rmse:0.174303+0.001093    test-rmse:0.185641+0.004381 
## [51] train-rmse:0.173027+0.001542    test-rmse:0.184653+0.004472 
## [61] train-rmse:0.172721+0.001591    test-rmse:0.184401+0.004494 
## [71] train-rmse:0.172530+0.001527    test-rmse:0.184273+0.004594 
## Stopping. Best iteration:
## [64] train-rmse:0.172530+0.001527    test-rmse:0.184273+0.004594
## 
## [1]  train-rmse:10.053011+0.001197   test-rmse:10.053136+0.007376 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:1.108900+0.001710    test-rmse:1.109776+0.006723 
## [21] train-rmse:0.222347+0.002502    test-rmse:0.230155+0.004941 
## [31] train-rmse:0.178977+0.001568    test-rmse:0.189399+0.003077 
## [41] train-rmse:0.175680+0.001386    test-rmse:0.186553+0.002764 
## [51] train-rmse:0.174520+0.001817    test-rmse:0.185486+0.002434 
## [61] train-rmse:0.173532+0.001698    test-rmse:0.184750+0.002457 
## [71] train-rmse:0.173019+0.001860    test-rmse:0.184366+0.002361 
## [81] train-rmse:0.172799+0.001803    test-rmse:0.184173+0.002472 
## [91] train-rmse:0.172492+0.001750    test-rmse:0.183942+0.002552 
## [100]    train-rmse:0.172315+0.001708    test-rmse:0.183800+0.002525 
## [1]  train-rmse:10.052717+0.001533   test-rmse:10.052326+0.008017 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:1.107835+0.000756    test-rmse:1.110334+0.004745 
## [21] train-rmse:0.218819+0.001301    test-rmse:0.228684+0.002670 
## [31] train-rmse:0.176305+0.000710    test-rmse:0.188047+0.002909 
## [41] train-rmse:0.174263+0.001311    test-rmse:0.186232+0.002891 
## [51] train-rmse:0.173381+0.001637    test-rmse:0.185312+0.003193 
## [61] train-rmse:0.172554+0.000784    test-rmse:0.184618+0.002670 
## [71] train-rmse:0.172313+0.000807    test-rmse:0.184412+0.002834 
## [81] train-rmse:0.171482+0.001151    test-rmse:0.183786+0.003130 
## [91] train-rmse:0.171110+0.001107    test-rmse:0.183509+0.003227 
## [100]    train-rmse:0.170757+0.001151    test-rmse:0.183240+0.002905 
## [1]  train-rmse:10.053125+0.001213   test-rmse:10.053862+0.007412 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:1.111033+0.002788    test-rmse:1.112406+0.003934 
## [21] train-rmse:0.223799+0.002406    test-rmse:0.232278+0.002338 
## [31] train-rmse:0.178900+0.001828    test-rmse:0.189787+0.003166 
## [41] train-rmse:0.175279+0.000961    test-rmse:0.186339+0.003881 
## [51] train-rmse:0.174262+0.001139    test-rmse:0.185435+0.004002 
## [61] train-rmse:0.173018+0.001337    test-rmse:0.184279+0.004317 
## [71] train-rmse:0.172380+0.001342    test-rmse:0.183883+0.004295 
## [81] train-rmse:0.171737+0.001261    test-rmse:0.183318+0.003670 
## [91] train-rmse:0.171670+0.001162    test-rmse:0.183258+0.003741 
## [100]    train-rmse:0.171616+0.001156    test-rmse:0.183197+0.003741 
## [1]  train-rmse:10.052551+0.001162   test-rmse:10.052608+0.006796 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:1.108106+0.001303    test-rmse:1.108728+0.003983 
## [21] train-rmse:0.218216+0.002153    test-rmse:0.226854+0.001800 
## [31] train-rmse:0.176149+0.001751    test-rmse:0.187639+0.001228 
## [41] train-rmse:0.173704+0.001528    test-rmse:0.185516+0.001598 
## [51] train-rmse:0.172177+0.000804    test-rmse:0.184470+0.002161 
## [61] train-rmse:0.171826+0.000875    test-rmse:0.184128+0.001980 
## [71] train-rmse:0.171155+0.000866    test-rmse:0.183639+0.002174 
## [81] train-rmse:0.170955+0.001063    test-rmse:0.183447+0.002096 
## [91] train-rmse:0.170574+0.001136    test-rmse:0.183086+0.001824 
## [100]    train-rmse:0.170368+0.000983    test-rmse:0.182906+0.001878 
## [1]  train-rmse:10.052856+0.001827   test-rmse:10.052921+0.008920 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:1.109055+0.001774    test-rmse:1.110953+0.005484 
## [21] train-rmse:0.219752+0.002847    test-rmse:0.229379+0.004159 
## [31] train-rmse:0.177726+0.001581    test-rmse:0.189392+0.006378 
## [41] train-rmse:0.174278+0.001516    test-rmse:0.186306+0.006723 
## [51] train-rmse:0.172554+0.000868    test-rmse:0.184645+0.007007 
## [61] train-rmse:0.172297+0.000703    test-rmse:0.184422+0.007188 
## [71] train-rmse:0.171537+0.000750    test-rmse:0.183880+0.006892 
## [81] train-rmse:0.170784+0.001085    test-rmse:0.183293+0.006652 
## [91] train-rmse:0.170454+0.001235    test-rmse:0.183013+0.006539 
## [100]    train-rmse:0.170454+0.001235    test-rmse:0.183015+0.006539 
## Stopping. Best iteration:
## [90] train-rmse:0.170454+0.001235    test-rmse:0.183011+0.006542
## 
## [1]  train-rmse:10.052655+0.001763   test-rmse:10.052597+0.007383 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:1.108083+0.000907    test-rmse:1.109058+0.004428 
## [21] train-rmse:0.217753+0.001221    test-rmse:0.227060+0.003340 
## [31] train-rmse:0.175264+0.001139    test-rmse:0.187611+0.002281 
## [41] train-rmse:0.172835+0.001630    test-rmse:0.185387+0.002990 
## [51] train-rmse:0.171869+0.001867    test-rmse:0.184524+0.002929 
## [61] train-rmse:0.171642+0.001892    test-rmse:0.184320+0.002733 
## [71] train-rmse:0.171557+0.001951    test-rmse:0.184240+0.002810 
## Stopping. Best iteration:
## [62] train-rmse:0.171556+0.001951    test-rmse:0.184239+0.002810
##     user   system  elapsed 
## 1175.041   11.685 1308.516
# finding the best combination of hyperparameters
output <- as.data.table(t(rmseErrorsHyperparameters))
varnames <- c("TestRMSE", "TrainRMSE", "SubSampRate", "ColSampRate", "Depth", "eta", "gamma")
names(output) <- varnames
optmimized_param <- output[TestRMSE == min(TestRMSE),3:7]

Now we will define the model with the best hyperparameters.

# Defining the model 
xgb_opt<-xgboost(booster='gbtree',
               data=xgb_ig_training,
               label= training$price,
               nrounds = ntrees,
               max.depth = optmimized_param[,'Depth'],
               eta = optmimized_param[,'eta'],
               subsample = optmimized_param[,'SubSampRate'],
               colsample_bytree = optmimized_param[,'ColSampRate'],
               min_child_weight = 1,
               objective='reg:linear')
## [1]  train-rmse:10.052924 
## [2]  train-rmse:8.045995 
## [3]  train-rmse:6.440963 
## [4]  train-rmse:5.156448 
## [5]  train-rmse:4.129259 
## [6]  train-rmse:3.307750 
## [7]  train-rmse:2.650942 
## [8]  train-rmse:2.126669 
## [9]  train-rmse:1.707042 
## [10] train-rmse:1.372038 
## [11] train-rmse:1.105457 
## [12] train-rmse:0.892457 
## [13] train-rmse:0.723715 
## [14] train-rmse:0.590428 
## [15] train-rmse:0.484095 
## [16] train-rmse:0.401033 
## [17] train-rmse:0.336576 
## [18] train-rmse:0.287551 
## [19] train-rmse:0.249214 
## [20] train-rmse:0.222039 
## [21] train-rmse:0.200967 
## [22] train-rmse:0.185659 
## [23] train-rmse:0.174319 
## [24] train-rmse:0.166255 
## [25] train-rmse:0.160776 
## [26] train-rmse:0.156682 
## [27] train-rmse:0.151849 
## [28] train-rmse:0.149243 
## [29] train-rmse:0.147081 
## [30] train-rmse:0.144992 
## [31] train-rmse:0.143502 
## [32] train-rmse:0.142305 
## [33] train-rmse:0.141352 
## [34] train-rmse:0.139594 
## [35] train-rmse:0.138257 
## [36] train-rmse:0.137356 
## [37] train-rmse:0.135985 
## [38] train-rmse:0.133537 
## [39] train-rmse:0.132951 
## [40] train-rmse:0.130554 
## [41] train-rmse:0.129795 
## [42] train-rmse:0.128975 
## [43] train-rmse:0.128113 
## [44] train-rmse:0.127190 
## [45] train-rmse:0.126474 
## [46] train-rmse:0.125018 
## [47] train-rmse:0.123243 
## [48] train-rmse:0.122736 
## [49] train-rmse:0.121889 
## [50] train-rmse:0.121351 
## [51] train-rmse:0.120466 
## [52] train-rmse:0.119956 
## [53] train-rmse:0.119519 
## [54] train-rmse:0.119019 
## [55] train-rmse:0.118506 
## [56] train-rmse:0.118056 
## [57] train-rmse:0.117613 
## [58] train-rmse:0.116505 
## [59] train-rmse:0.115565 
## [60] train-rmse:0.115249 
## [61] train-rmse:0.114792 
## [62] train-rmse:0.114130 
## [63] train-rmse:0.113488 
## [64] train-rmse:0.112393 
## [65] train-rmse:0.112193 
## [66] train-rmse:0.111713 
## [67] train-rmse:0.111415 
## [68] train-rmse:0.110427 
## [69] train-rmse:0.110075 
## [70] train-rmse:0.109098 
## [71] train-rmse:0.108790 
## [72] train-rmse:0.107891 
## [73] train-rmse:0.107494 
## [74] train-rmse:0.107273 
## [75] train-rmse:0.106934 
## [76] train-rmse:0.106121 
## [77] train-rmse:0.105797 
## [78] train-rmse:0.105413 
## [79] train-rmse:0.105112 
## [80] train-rmse:0.104701 
## [81] train-rmse:0.104271 
## [82] train-rmse:0.103689 
## [83] train-rmse:0.102949 
## [84] train-rmse:0.102561 
## [85] train-rmse:0.101933 
## [86] train-rmse:0.101014 
## [87] train-rmse:0.100553 
## [88] train-rmse:0.100383 
## [89] train-rmse:0.100099 
## [90] train-rmse:0.099597 
## [91] train-rmse:0.099309 
## [92] train-rmse:0.098876 
## [93] train-rmse:0.098775 
## [94] train-rmse:0.098658 
## [95] train-rmse:0.098342 
## [96] train-rmse:0.097966 
## [97] train-rmse:0.097734 
## [98] train-rmse:0.097543 
## [99] train-rmse:0.097206 
## [100]    train-rmse:0.096907
xgb.pred.opt<-predict(xgb_opt, newdata = xgb_ig_validation, type='response')
# evaluation
xgb.mape <- mape(xgb.pred.opt, validation$price)
xgb.price_error <- mean(abs((exp(xgb.pred.opt) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(xgb.pred.opt) -1), observed=(exp(validation$price) -1)))

# plotting the results
ggplot(my_data, aes(predicted, observed)) +
    geom_point() + geom_smooth(method = "glm") +
    labs(x="Predicted") +
    ggtitle(ggtitle(paste("XGBoost", ' MAPE: ', format(round(xgb.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(xgb.price_error, 0), nsmall=0), 
                        ' €', sep=''))) +  
    scale_x_continuous(labels = scales::comma) + 
    scale_y_continuous(labels = scales::comma)

Regression with StepWise feature selection

In this secion we’ll try to apply Regression with StepWise feature selection.

# adapting the dataset to work for the algorithm
lm_ig_training <- ig_training
lm_ig_training$yr_built <- as.numeric(lm_ig_training$yr_built)
lm_ig_training$yr_renovated <- as.numeric(lm_ig_training$yr_renovated)

lm_ig_validation <- ig_validation
lm_ig_validation$yr_built <- as.numeric(lm_ig_validation$yr_built)
lm_ig_validation$yr_renovated <- as.numeric(lm_ig_validation$yr_renovated)

#### Regression with StepWise feature selection 
lm_0<-stepAIC(lm(formula = formula, 
                 data=lm_ig_training),
              trace=F)

lm.pred<-predict(lm_0, newdata = lm_ig_validation)

# evaluation
lm.mape <- mape(lm.pred, validation$price)
lm.price_error <- mean(abs((exp(lm.pred) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(lm.pred) -1), observed=(exp(validation$price) -1)))

# plotting the results
ggplot(my_data, aes(predicted, observed)) +
    geom_point() + geom_smooth(method = "glm") +
    labs(x="Predicted") +
    ggtitle(ggtitle(paste("StepWise", ' MAPE: ', format(round(lm.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(lm.price_error, 0), nsmall=0), 
                        ' €', sep=''))) +  
    scale_x_continuous(labels = scales::comma) + 
    scale_y_continuous(labels = scales::comma)

Regression with regularization

in this section wel’ll try to apply Regression with regularization

# Defining the model
glmnet_cv<-cv.glmnet(x = xgb_ig_training,
                     nfolds = 5,
                     y = training[['price']],
                     alpha=1,
                     family = 'gaussian',
                     standardize = T)
plot.cv.glmnet(glmnet_cv)

glmnet_cv$lambda.min
## [1] 0.0001452647
glmnet_0<-glmnet(x = xgb_ig_training, 
                 y = training[['price']],
                 family = 'gaussian',
                 alpha=1, lambda = glmnet_cv$lambda.min)
glmnet_0$beta
## 281 x 1 sparse Matrix of class "dgCMatrix"
##                              s0
## bedrooms          -1.479584e-03
## bathrooms          3.492611e-02
## sqft_living        9.468635e-05
## sqft_lot           7.759506e-02
## floors            -3.451766e-02
## waterfront         6.222213e-01
## view_1             1.186810e-01
## view_2             1.107920e-01
## view_3             1.917645e-01
## view_4             2.860010e-01
## condition_2       -2.432024e-03
## condition_3        1.838034e-01
## condition_4        2.398081e-01
## condition_5        3.075722e-01
## sqft_above         1.303086e-04
## sqft_basement      8.405461e-03
## yr_built_1901     -1.468549e-02
## yr_built_1902     -5.093923e-02
## yr_built_1903      4.294774e-02
## yr_built_1904      4.104774e-02
## yr_built_1905      4.051628e-02
## yr_built_1906      2.845928e-02
## yr_built_1907      6.363098e-02
## yr_built_1908     -5.487049e-02
## yr_built_1909      5.907798e-02
## yr_built_1910      1.410455e-02
## yr_built_1911      8.108415e-02
## yr_built_1912      5.098234e-03
## yr_built_1913      4.978905e-02
## yr_built_1914      .           
## yr_built_1915      6.058584e-02
## yr_built_1916     -1.515324e-03
## yr_built_1917      .           
## yr_built_1918     -3.724505e-03
## yr_built_1919      2.136295e-02
## yr_built_1920     -1.843906e-02
## yr_built_1921      1.924526e-02
## yr_built_1922      2.309567e-02
## yr_built_1923      6.119913e-02
## yr_built_1924      2.522968e-02
## yr_built_1925      1.477596e-02
## yr_built_1926      6.855579e-03
## yr_built_1927      1.887382e-02
## yr_built_1928      4.027681e-02
## yr_built_1929      1.897827e-02
## yr_built_1930      8.829040e-03
## yr_built_1931      4.761113e-02
## yr_built_1932      8.683536e-02
## yr_built_1933      6.213484e-03
## yr_built_1934      7.460732e-03
## yr_built_1935     -1.135776e-01
## yr_built_1936     -1.159552e-02
## yr_built_1937      2.998888e-02
## yr_built_1938      2.610192e-02
## yr_built_1939      2.424992e-02
## yr_built_1940      8.259341e-04
## yr_built_1941      1.760638e-02
## yr_built_1942     -9.915578e-03
## yr_built_1943     -5.453822e-02
## yr_built_1944     -2.079697e-02
## yr_built_1945     -6.168704e-02
## yr_built_1946      1.204894e-02
## yr_built_1947     -3.735979e-02
## yr_built_1948     -6.895406e-03
## yr_built_1949      .           
## yr_built_1950     -2.480758e-02
## yr_built_1951     -4.851633e-02
## yr_built_1952     -1.605810e-02
## yr_built_1953     -4.283819e-02
## yr_built_1954     -6.081894e-02
## yr_built_1955     -6.173728e-02
## yr_built_1956     -2.021930e-02
## yr_built_1957     -4.418181e-02
## yr_built_1958     -5.109200e-02
## yr_built_1959     -5.536560e-02
## yr_built_1960     -6.986957e-02
## yr_built_1961     -7.916205e-02
## yr_built_1962     -7.424556e-02
## yr_built_1963     -4.488380e-02
## yr_built_1964     -9.625955e-02
## yr_built_1965     -7.459859e-02
## yr_built_1966     -9.407850e-02
## yr_built_1967     -6.071106e-02
## yr_built_1968     -4.571427e-02
## yr_built_1969     -4.517778e-02
## yr_built_1970     -4.828831e-02
## yr_built_1971     -5.571726e-02
## yr_built_1972     -5.614270e-02
## yr_built_1973     -3.447103e-02
## yr_built_1974     -5.961803e-02
## yr_built_1975     -6.862180e-02
## yr_built_1976     -4.299703e-02
## yr_built_1977     -6.210875e-02
## yr_built_1978     -8.989559e-02
## yr_built_1979     -8.055785e-02
## yr_built_1980     -4.233637e-02
## yr_built_1981     -1.230592e-02
## yr_built_1982     -2.255233e-02
## yr_built_1983      .           
## yr_built_1984     -2.461431e-02
## yr_built_1985     -1.569808e-02
## yr_built_1986     -4.954929e-02
## yr_built_1987     -3.471095e-02
## yr_built_1988     -3.726636e-02
## yr_built_1989     -1.261129e-02
## yr_built_1990     -3.307039e-02
## yr_built_1991     -1.088332e-02
## yr_built_1992      3.377267e-03
## yr_built_1993      2.696437e-02
## yr_built_1994     -1.313832e-02
## yr_built_1995      5.191167e-03
## yr_built_1996      2.377057e-02
## yr_built_1997     -5.698978e-03
## yr_built_1998     -5.068217e-03
## yr_built_1999      2.879884e-02
## yr_built_2000      3.452288e-02
## yr_built_2001      4.662907e-02
## yr_built_2002      2.686386e-02
## yr_built_2003      1.789817e-02
## yr_built_2004      2.549285e-02
## yr_built_2005      .           
## yr_built_2006      2.061399e-02
## yr_built_2007      3.156781e-03
## yr_built_2008      3.541296e-02
## yr_built_2009      6.133105e-02
## yr_built_2010      7.146815e-02
## yr_built_2011      7.666879e-02
## yr_built_2012      9.991995e-02
## yr_built_2013      9.921345e-02
## yr_built_2014      1.051839e-01
## yr_built_2015      1.838133e-01
## yr_renovated_1934  1.383457e-01
## yr_renovated_1940 -2.177388e-01
## yr_renovated_1944  .           
## yr_renovated_1945  4.123770e-02
## yr_renovated_1946  2.402046e-01
## yr_renovated_1948 -1.502751e-02
## yr_renovated_1950 -2.505284e-01
## yr_renovated_1951  .           
## yr_renovated_1953 -1.095507e-01
## yr_renovated_1954  4.067528e-01
## yr_renovated_1955 -8.110347e-02
## yr_renovated_1956 -1.914145e-01
## yr_renovated_1957 -2.722276e-02
## yr_renovated_1958  2.159922e-02
## yr_renovated_1959 -2.385730e-01
## yr_renovated_1960 -1.890948e-01
## yr_renovated_1962 -2.716533e-01
## yr_renovated_1963 -3.283154e-01
## yr_renovated_1964 -6.907815e-03
## yr_renovated_1965 -4.605112e-02
## yr_renovated_1967  1.285162e-01
## yr_renovated_1968 -1.451889e-01
## yr_renovated_1969 -1.143295e-01
## yr_renovated_1970 -1.075913e-01
## yr_renovated_1971  .           
## yr_renovated_1972  .           
## yr_renovated_1973 -2.568962e-01
## yr_renovated_1974 -3.631841e-01
## yr_renovated_1975  9.353236e-02
## yr_renovated_1976 -2.497798e-01
## yr_renovated_1977  1.102054e-01
## yr_renovated_1978  2.385313e-01
## yr_renovated_1979  6.338237e-02
## yr_renovated_1980 -4.233997e-02
## yr_renovated_1981  4.605575e-02
## yr_renovated_1982 -3.267275e-02
## yr_renovated_1983  6.865881e-02
## yr_renovated_1984  .           
## yr_renovated_1985 -4.184842e-02
## yr_renovated_1986  1.780231e-01
## yr_renovated_1987 -4.196133e-03
## yr_renovated_1988  5.186214e-02
## yr_renovated_1989  1.396939e-01
## yr_renovated_1990  1.935327e-02
## yr_renovated_1991  1.467447e-01
## yr_renovated_1992 -2.443194e-04
## yr_renovated_1993  1.004962e-01
## yr_renovated_1994  1.485293e-01
## yr_renovated_1995  9.070973e-03
## yr_renovated_1996  2.891881e-02
## yr_renovated_1997 -4.261392e-02
## yr_renovated_1998  7.090450e-02
## yr_renovated_1999  1.442544e-01
## yr_renovated_2000  5.223952e-02
## yr_renovated_2001  8.942797e-02
## yr_renovated_2002  3.374476e-01
## yr_renovated_2003  1.017245e-01
## yr_renovated_2004  1.781804e-01
## yr_renovated_2005  7.459856e-02
## yr_renovated_2006  7.129004e-02
## yr_renovated_2007  1.535018e-01
## yr_renovated_2008  2.092712e-01
## yr_renovated_2009  2.511497e-01
## yr_renovated_2010  1.726380e-01
## yr_renovated_2011  6.491641e-02
## yr_renovated_2012  5.840516e-02
## yr_renovated_2013  1.328875e-01
## yr_renovated_2014  1.292709e-01
## yr_renovated_2015 -2.726982e-02
## zipcode_98002     -5.155605e-02
## zipcode_98003     -2.519797e-02
## zipcode_98004      7.757979e-01
## zipcode_98005      3.990207e-01
## zipcode_98006      3.807750e-01
## zipcode_98007      3.368283e-01
## zipcode_98008      3.647714e-01
## zipcode_98010      1.948504e-01
## zipcode_98011     -7.018574e-02
## zipcode_98014     -6.876293e-02
## zipcode_98019     -1.432063e-01
## zipcode_98022      1.527692e-01
## zipcode_98023     -7.342754e-02
## zipcode_98024      1.758253e-01
## zipcode_98027      2.905419e-01
## zipcode_98028     -1.055415e-01
## zipcode_98029      3.794873e-01
## zipcode_98030     -2.746105e-02
## zipcode_98031     -3.822223e-02
## zipcode_98032     -8.767720e-02
## zipcode_98033      3.664799e-01
## zipcode_98034      8.833171e-02
## zipcode_98038      9.695804e-02
## zipcode_98039      8.700714e-01
## zipcode_98040      5.870704e-01
## zipcode_98042     -5.469124e-03
## zipcode_98045      2.133428e-01
## zipcode_98052      2.394662e-01
## zipcode_98053      1.711794e-01
## zipcode_98055     -4.745051e-02
## zipcode_98056      9.268604e-02
## zipcode_98058      1.506511e-02
## zipcode_98059      1.144108e-01
## zipcode_98065      1.976417e-01
## zipcode_98070      6.052206e-02
## zipcode_98072     -1.907643e-02
## zipcode_98074      2.173386e-01
## zipcode_98075      2.591786e-01
## zipcode_98077     -5.483669e-02
## zipcode_98092     -1.508845e-03
## zipcode_98102      5.900410e-01
## zipcode_98103      3.804228e-01
## zipcode_98105      5.312689e-01
## zipcode_98106      5.579203e-02
## zipcode_98107      4.177461e-01
## zipcode_98108      4.918535e-02
## zipcode_98109      6.101729e-01
## zipcode_98112      6.640970e-01
## zipcode_98115      3.771884e-01
## zipcode_98116      4.061558e-01
## zipcode_98117      3.444787e-01
## zipcode_98118      1.791612e-01
## zipcode_98119      5.849446e-01
## zipcode_98122      4.504682e-01
## zipcode_98125      8.844400e-02
## zipcode_98126      2.319060e-01
## zipcode_98133     -4.656203e-02
## zipcode_98136      3.537941e-01
## zipcode_98144      3.245858e-01
## zipcode_98146     -1.130365e-02
## zipcode_98148     -3.336120e-02
## zipcode_98155     -1.124332e-01
## zipcode_98166      1.079610e-01
## zipcode_98168     -1.615768e-01
## zipcode_98177      7.960884e-02
## zipcode_98178     -7.217630e-02
## zipcode_98188     -7.143033e-02
## zipcode_98198     -5.456456e-02
## zipcode_98199      4.319017e-01
## lat                1.102685e+00
## long              -2.756904e-01
## sqft_living15      6.929315e-05
## sqft_lot15        -1.671884e-03
## TotalSqFeet        2.051271e-06
## grade_agg_8        7.415099e-02
## grade_agg_9        1.448697e-01
## grade_agg_6       -1.183688e-01
## grade_agg_10       2.055613e-01
## grade_agg_11       3.524207e-01
## grade_agg_5       -2.393488e-01
## grade_agg_others   4.068964e-01
glmnet.pred <- predict(glmnet_0, newx = xgb_ig_validation)

# Evaluation
glmnet.mape <- mape(glmnet.pred, validation$price)
glmnetprice_error <- mean(abs((exp(glmnet.pred) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(glmnet.pred) -1), observed=(exp(validation$price) -1)))

# Plotting the results
ggplot(my_data, aes(s0, observed)) +
    geom_point() + geom_smooth(method = "glm") +
    labs(x="Predicted") +
    ggtitle(ggtitle(paste("Regression with regularization", ' MAPE: ', format(round(glmnet.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(glmnetprice_error, 0), nsmall=0), 
                        ' €', sep=''))) +  
    scale_x_continuous(labels = scales::comma) + 
    scale_y_continuous(labels = scales::comma)

Final Submission

We splitted the original training data into train and validation to evaluate the candidate models. In order to generate the final submission we have to take instead all the data at our disposal.

In addition, we also applied a log transformation to the target variable, to revert this transformation you have to use the exp function.

In order to do my prediction I have tried all the combination of the models explained above. The best model is the optimized xgb.

# Train the model using all the data
final.model <- xgb_opt

# Predict the prices for the test data (i.e., we use the exp function to revert the log transformation that we applied to the target variable)
final.pred <- as.numeric(exp(predict(final.model, xgb_ig_test, type='response'))-1)
final.pred[is.na(final.pred)]
## numeric(0)
hist(final.pred, main="Histogram of Predictions", xlab = "Predictions")

xgb_submission <- data.frame(Id = original_test_data$id, price= (final.pred))
colnames(xgb_submission) <-c("Id", "price")
write.csv(xgb_submission, file = "submission1.csv", row.names = FALSE)